INTERSPEECH.2011 - Language and Multimodal

Total: 145

#1 Perceptual learning of liquids [PDF] [Copy] [Kimi1]

Authors: Odette Scharenborg ; Holger Mitterer ; James M. McQueen

Previous research on lexically-guided perceptual learning has focussed on contrasts that differ primarily in local cues, such as plosive and fricative contrasts. The present research had two aims: to investigate whether perceptual learning occurs for a contrast with non-local cues, the /l/-/r/ contrast, and to establish whether STRAIGHT can be used to create ambiguous sounds on an /l/-/r/ continuum. Listening experiments showed lexically-guided learning about the /l/-/r/ contrast. Listeners can thus tune in to unusual speech sounds characterised by non-local cues. Moreover, STRAIGHT can be used to create stimuli for perceptual learning experiments, opening up new research possibilities.

#2 The efficiency of cross-dialectal word recognition [PDF] [Copy] [Kimi1]

Authors: Annelie Tuinman ; Holger Mitterer ; Anne Cutler

Dialects of the same language can differ in the casual speech processes they allow; e.g., British English allows the insertion of [r] at word boundaries in sequences such as saw ice, while American English does not. In two speeded word recognition experiments, American listeners heard such British English sequences; in contrast to non-native listeners, they accurately perceived intended vowel-initial words even with intrusive [r]. Thus despite input mismatches, cross-dialectal word recognition benefits from the full power of native-language processing.

#3 Estimation of perceptual spaces for speaker identities based on the cross-lingual discrimination task [PDF] [Copy] [Kimi1]

Authors: Minoru Tsuzaki ; Keiichi Tokuda ; Hisashi Kawai ; Jinfu Ni

This paper reconfirms that talker identity can be transmitted across languages. Talker discrimination was examined in the ABX paradigm, where the stimuli A and B were utterances by different talkers in the same language and the stimulus X was an utterance by either of A or B in the different language. The average hit rate of this discrimination task was as high as 0.89. The mutual distance matrices were generated using the discrimination index, d. By applying the multidimensional scaling, three-dimensional perceptual spaces were estimated. The features related with loudness and spectral centroid had high contribution to the perceptual dimensions.

#4 The relation between perception and production in L2 phonological processing [PDF] [Copy] [Kimi1]

Authors: Sharon Peperkamp ; Camillia Bouchon

Seventeen French-English bilinguals read aloud a set of English sentences and performed an ABX discrimination task that assessed their perception of the English /I/-/i/ contrast. Global nativelikeness in production correlated with pronunciation accuracy for the vowels /I/ and /i/, and both production measures correlated with self-estimated pronunciation skills. However, performance on the perception task did not correlate with either global nativelikeness or /I,i/ pronunciation accuracy. These results are discussed in light of theories about the relation between perception and production in L2 phonological processing.

#5 The role of word-initial glottal stops in recognizing English words [PDF] [Copy] [Kimi1]

Authors: Maria Paola Bissiri ; Maria Luisa Garcia Lecumberri ; Martin Cooke ; Jan Volín

English word-initial vowels in natural continuous speech are optionally preceded by glottal stops or functionally equivalent glottalizations. It may be claimed that these glottal elements disturb the smooth flow of speech. However, they clearly mark word boundaries, which may potentially facilitate speech processing in the brain of the listener. The present study utilizes the word-monitoring paradigm to determine whether listeners react faster to words with or without glottalizations. Three groups of subjects were compared: Czech and Spanish learners of English and native English speakers. The results indicate that perceptual use of glottalization for word segmentation is not entirely governed by universal rules and reflects the mother tongue of the listener as well as the status (L1/L2) of the target language.

#6 Effect of language experience on the categorical perception of Cantonese vowel duration [PDF] [Copy] [Kimi1]

Authors: Caicai Zhang ; Gang Peng ; William S.-Y. Wang

This study investigated the effect of language experience on the categorical perception of Cantonese vowel duration distinction. By comparing Cantonese and Mandarin listeners' performances, we found that: (1) duration change elicited categorical perception in the performance of Cantonese listeners, but not in Mandarin listeners; (2) Cantonese listeners were affected by the vowel quality differences, whereas Mandarin subjects were generally unbiased towards the quality differences; (3) effect of duration was overridden by the vowel quality [a] condition in the performance of Cantonese listeners. Our findings suggested that vowel quality is incorporated as a phonological cue in Cantonese.

#7 Novel VTEO based mel cepstral features for classification of normal and pathological voices [PDF] [Copy] [Kimi1]

Authors: Hemant A. Patil ; Pallavi N. Baljekar

In this paper, novel Variable length Teager Energy Operator (VTEO) based Mel cepstral features, viz., VTMFCC are proposed for automatic classification of normal and pathological voices. Experiments have been carried out using this proposed feature set, MFCC and their score-level fusion. Classification was performed using a 2nd order polynomial classifier on a subset of the MEEI database. The equal error rate (EER) on fusion was 3.2% less than EER of MFCC alone which was used as the baseline. Effectiveness of the proposed feature-set was also investigated under degraded conditions using the NOISEX-92 database for babble and high frequency channel noise.

#8 Temporal performance of dysarthric patients in speech and tapping tasks [PDF] [Copy] [Kimi1]

Authors: Eiji Shimura ; Kazuhiko Kakehi

Dysarthria is defined as a locomotor disorder of the vocal speech organ due to a pathological change of nerve and muscle systems. Several methods of speaking rate control have been widely used for the rehabilitation of dysarthria. However, these methods are not always effective depending on the condition of the dysarthric patient. In this study, we investigated the performance of tempo perception of dysarthrias, which has not yet been fully studied. Several types of experiments were conducted for both dysarthric patients and normal subjects. The experiments included speech production and tapping tasks with and without reference samples of utterances or tapping.

#9 A comparative acoustic study on speech of glossectomy patients and normal subjects [PDF] [Copy] [Kimi1]

Authors: Xinhui Zhou ; Maureen Stone ; Carol Y. Espy-Wilson

Oral, head and neck cancer represents 3% of all cancers in the United States and is the 6th most common cancer worldwide. Tongue cancer patients are treated by glossectomy, a surgical procedure to remove the cancerous tumor. As a result, the tongue properties such as volume, shape, muscle structure, and motility are affected. As a result, the vocal tract acoustics are affected too. This study compares the speech acoustics between normal subjects and partial glossectomy patients with T1 or T2 tumors. The acoustic signal of four vowels (/iy/, /uw/, /eh/, and /ah/) and two fricatives (/s/ and /sh/) were analyzed. Our results show that, while the average formants (F1-F3) for the four vowels between the normal subjects and the glossectomy patients are very similar, the average centers of gravity for the two fricatives differ significantly. These differences in fricatives can be explained by the more posterior constriction in patients due to the glossectomy (or the cancer tumor) and its resulting longer front cavity.

#10 Dysperiodicity analysis of perceptually assessed synthetic speech stimuli [PDF] [Copy] [Kimi1]

Authors: Ali Alpan ; Francis Grenez ; Jean Schoentgen

The objective is to analyze vocal dysperiodicities in perceptually assessed synthetic speech sounds. The analysis involves a variogram-based method that enables tracking instantaneous vocal dysperiodicities. The dysperiodicity trace is summarized by means of the signal-to-dysperiodicity ratio, which has been shown to correlate strongly with the perceived degree of hoarseness of the speaker. The stimuli have been generated by a synthesizer of disordered voices that has been shown to generate natural-sounding speech fragments comprising diverse vocal perturbations. The speech stimuli have been perceptually assessed by nine listeners according to grade, breathiness and roughness. In previous studies, signal-to-dysperiodicity ratios have been correlated with perceived degrees of hoarseness. The objective here is to extend the analysis to roughness and breathiness. A second objective is to analyze the dependance of the signal-to-dysperiodicity ratio on the signal properties fixed by the synthesizer parameters. Results show a good correlation between signal-to-dysperiodicity ratios and perceptual scores. At most two frequency bands are necessary to predict the perceptual scores. Additive noise contributes most followed by jitter. The interaction between noise parameters, vocal frequency and vowel category contribute moderately or feebly.

#11 Is the perception of voice quality language-dependent? a comparison of French and Italian listeners and dysphonic speakers [PDF] [Copy] [Kimi1]

Authors: Alain Ghio ; Frédérique Weisz ; Giovanna Baracca ; Giovanna Cantarella ; Danièle Robert ; Virginie Woisard ; Franco Fussi ; Antoine Giovanni

We present an experiment where voice quality of French and Italian dysphonic speakers was evaluated by French and Italian listeners, specialists in phoniatrics. Results showed that both groups of speakers were perceived in the same way by the two groups of listeners in term of overall severity and breathiness. But the perception of roughness is clearly language dependant. Italian listeners underestimate roughness compare to French listeners. If we link these results obtained in perception with measures obtained in speech production, we can make the hypothesis that it is a case of perception/production adaptation process.

#12 Automatic selection of acoustic and non-linear dynamic features in voice signals for hypernasality detection [PDF] [Copy] [Kimi1]

Authors: J. R. Orozco-Arroyave ; S. Murillo-Rendón ; A. M. Álvarez-Meza ; J. D. Arias-Londoño ; E. Delgado-Trejos ; J. F. Vargas-Bonilla ; C. G. Castellanos-Domínguez

Automatic detection of hypernasality in voices of children with Cleft Lip and Palate (CLP) is made considering two characterization techniques, one based on acoustic, noise and cepstral analysis and other based on nonlinear dynamic features. Besides characterization, two automatic feature selection techniques are implemented in order to find optimal sub-spaces to better discriminate between healthy and hypernasal voices. Results indicate that nonlinear dynamic features are valuable tool for automatic detection of hypernasality; additionally both feature selection techniques show stable and consistent results, achieving accuracy levels of up to 93.73%.

#13 Asynchronous multimodal text entry using speech and gesture keyboards [PDF] [Copy] [Kimi1]

Authors: Per Ola Kristensson ; Keith Vertanen

We propose reducing errors in text entry by combining speech and gesture keyboard input. We describe a merge model that combines recognition results in an asynchronous and flexible manner. We collected speech and gesture data of users entering both short email sentences and web search queries. By merging recognition results from both modalities, word error rate was reduced by 53% relative for email sentences and 29% relative for web searches. For email utterances with speech errors, we investigated providing gesture keyboard corrections of only the erroneous words. Without the user explicitly indicating the incorrect words, our model was able to reduce the word error rate by 44% relative.

#14 Robust bimodal person identification using face and speech with limited training data and corruption of both modalities [PDF] [Copy] [Kimi1]

Authors: Niall McLaughlin ; Ji Ming ; Danny Crookes

This paper presents a novel method of audio-visual fusion for person identification where both the speech and facial modalities may be corrupted, and there is a lack of prior knowledge about the corruption. Furthermore, we assume there is a limited amount of training data for each modality (e.g., a short training speech segment and a single training facial image for each person). A new representation and a modified cosine similarity are introduced for combining and comparing bimodal features with limited training data as well as vastly differing data rates and feature sizes. Optimal feature selection and multicondition training are used to reduce the mismatch between training and testing, thereby making the system robust to unknown bimodal corruption. Experiments have been carried out on a bimodal data set created from the SPIDRE and AR databases with variable noise corruption of speech and occlusion in the face images. The new method has demonstrated improved recognition accuracy.

#15 Toward a multi-speaker visual articulatory feedback system [PDF] [Copy] [Kimi1]

Authors: Atef Ben Youssef ; Thomas Hueber ; Pierre Badin ; Gérard Bailly

In this paper, we present recent developments on the HMM-based acoustic-to-articulatory inversion approach that we develop for a "visual articulatory feedback" system. In this approach, multistream phoneme HMMs are trained jointly on synchronous streams of acoustic and articulatory data, acquired by electromagnetic articulography (EMA). Acoustic-to-articulatory inversion is achieved in two steps. Phonetic and state decoding is first performed. Then articulatory trajectories are inferred from the decoded phone and state sequence using the maximum-likelihood parameter generation algorithm (MLPG). We introduce here a new procedure for the re-estimation of the HMM parameters, based on the Minimum Generation Error criterion (MGE). We also investigate the use of model adaptation techniques based on maximum likelihood linear regression (MLLR), as a first step toward a multi-speaker visual articulatory feedback system.

#16 Statistical mapping between articulatory and acoustic data for an ultrasound-based silent speech interface [PDF] [Copy] [Kimi1]

Authors: Thomas Hueber ; Elie-Laurent Benaroya ; Bruce Denby ; Gérard Chollet

This paper presents recent developments on our "silent speech interface" that converts tongue and lip motions, captured by ultrasound and video imaging, into audible speech. In our previous studies, the mapping between the observed articulatory movements and the resulting speech sound was achieved using a unit selection approach. We investigate here the use of statistical mapping techniques, based on the joint modeling of visual and spectral features, using respectively Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM). The prediction of the voiced/unvoiced parameter from visual articulatory data is also investigated using an artificial neural network (ANN). A continuous speech database consisting of one-hour of high-speed ultrasound and video sequences was specifically recorded to evaluate the proposed mapping techniques.

#17 Unsupervised geometry calibration of acoustic sensor networks using source correspondences [PDF] [Copy] [Kimi1]

Authors: Joerg Schmalenstroeer ; Florian Jacob ; Reinhold Haeb-Umbach ; Marius H. Hennecke ; Gernot A. Fink

In this paper we propose a procedure for estimating the geometric configuration of an arbitrary acoustic sensor placement. It determines the position and the orientation of microphone arrays in 2D while locating a source by direction-of-arrival (DoA) estimation. Neither artificial calibration signals nor unnatural user activity are required. The problem of scale indeterminacy inherent to DoA-only observations is solved by adding time difference of arrival (TDoA) measurements. The geometry calibration method is numerically stable and delivers precise results in moderately reverberated rooms. Simulation results are confirmed by laboratory experiments.

#18 Investigations on speaking mode discrepancies in EMG-based speech recognition [PDF] [Copy] [Kimi1]

Authors: Michael Wand ; Matthias Janke ; Tanja Schultz

In this paper we present our recent study on the impact of speaking mode variabilities on speech recognition by surface electromyography (EMG). Surface electromyography captures the electric potentials of the human articulatory muscles, which enables a user to communicate naturally without making any audible sound. Our previous experiments have shown that the EMG signal varies greatly between different speaking modes, like audibly uttered speech and silently articulated speech. In this study we extend our previous research and quantify the impact of different speaking modes by investigating the amount of mode-specific leaves in phonetic decision trees. We show that this measure correlates highly with discrepancies in the spectral energy of the EMG signal, as well as with differences in the performance of a recognizer on different speaking modes. We furthermore present how EMG signal adaptation by spectral mapping decreases the effect of the speaking mode.

#19 Multi-task learning for spoken language understanding with shared slots [PDF] [Copy] [Kimi1]

Authors: Xiao Li ; Ye-Yi Wang ; Gokhan Tur

This paper addresses the problem of learning multiple spoken language understanding (SLU) tasks that have overlapping sets of slots. In such a scenario, it is possible to achieve better slot filling performance by learning multiple tasks simultaneously, as opposed to learning them independently. We focus on presenting a number of simple multi-task learning algorithms for slot filling systems based on semi-Markov CRFs, assuming the knowledge of shared slots. Furthermore, we discuss an intra-domain clustering method that automatically discovers shared slots from training data. The effectiveness of our proposed approaches is demonstrated in an SLU application that involves three different yet related tasks.

#20 Learning weighted entity lists from web click logs for spoken language understanding [PDF] [Copy] [Kimi1]

Authors: Dustin Hillard ; Asli Celikyilmaz ; Dilek Hakkani-Tür ; Gokhan Tur

Named entity lists provide important features for language understanding, but typical lists can contain many ambiguous or incorrect phrases. We present an approach for automatically learning weighted entity lists by mining user clicks from web search logs. The approach significantly outperforms multiple baseline approaches and the weighted lists improve spoken language understanding tasks such as domain detection and slot filling. Our methods are general and can be easily applied to large quantities of entities, across any number of lists.

#21 Bootstrapping domain detection using query click logs for new domains [PDF] [Copy] [Kimi1]

Authors: Dilek Hakkani-Tür ; Gokhan Tur ; Larry Heck ; Elizabeth Shriberg

Domain detection in spoken dialog systems is usually treated as a multi-class, multi-label classification problem, and training of domain classifiers requires collection and manual annotation of example utterances. In order to extend a dialog system to new domains in a way that is seamless for users, domain detection should be able to handle utterances from the new domain as soon as it is introduced. In this work, we propose using web search query logs, which include queries entered by users and the links they subsequently click on, to bootstrap domain detection for new domains. While sampling user queries from the query click logs to train new domain classifiers, we introduce two types of measures based on the behavior of the users who entered a query and the form of the query. We show that both types of measures result in reductions in the error rate as compared to randomly sampling training queries. In controlled experiments over five domains, we achieve the best gain from the combination of the two types of sampling criteria.

#22 Approximate inference for domain detection in spoken language understanding [PDF] [Copy] [Kimi1]

Authors: Asli Celikyilmaz ; Dilek Hakkani-Tür ; Gokhan Tur

This paper presents a semi-latent topic model for semantic domain detection in spoken language understanding systems. We use labeled utterance information to capture latent topics, which directly correspond to semantic domains. Additionally, we introduce an 'informative prior' for Bayesian inference that can simultaneously segment utterances of known domains into classes and divide them from out-of-domain utterances. We show that our model generalizes well on the task of classifying spoken language utterances and compare its results to those of an unsupervised topic model, which does not use labeled information.

#23 Speech indexing using semantic context inference [PDF] [Copy] [Kimi1]

Authors: Chien-Lin Huang ; Bin Ma ; Haizhou Li ; Chung-Hsien Wu

This study presents a novel approach to spoken document retrieval based on semantic context inference for speech indexing. Each recognized term in a spoken document is mapped onto a semantic inference vector containing a bag of semantic terms through a semantic relation matrix. The semantic context inference vector is then constructed by summing up all the semantic inference vectors. Such a semantic term expansion and re-weighting make the semantic context inference vector a suitable representation for speech indexing. The experiments were conducted on 1550 anchor news stories collected from Mandarin Chinese broadcast news of 198 hours. The experimental results indicate that the proposed speech indexing using the semantic context inference contributes to a substantial performance improvement of spoken document retrieval.

#24 Automatically optimizing utterance classification performance without human in the loop [PDF] [Copy] [Kimi1]

Authors: Yun-Cheng Ju ; Jasha Droppo

The Utterance Classification (UC) method has become a developer's choice over traditional Context Free Grammars (CFGs) for voice menus in telephony applications. This data driven method achieves higher accuracy and has great potential to utilize a huge amount of labeled training data. But, having a human manually label the training data can be expensive. This paper provides a robust recipe for training a UC system using inexpensive acoustic data with limited transcriptions or semantic labels. It also describes two new algorithms that use caller confirmation, which naturally occurred within a dialog, to generate pseudo semantic labels. Experimental results show that, after having sufficient labeled data to achieve a reasonable accuracy, both of our algorithms can use unlabeled data to achieve the same performance as a system trained with labeled data, while completely eliminating the need for human supervision.

#25 The multi timescale phoneme acquisition model of the self-organizing based on the dynamic features [PDF] [Copy] [Kimi1]

Authors: Kouki Miyazawa ; Hideaki Miura ; Hideaki Kikuchi ; Reiko Mazuka

It is unclear as to how infants learn the acoustic expression of each phoneme of their native languages. In recent studies, researchers have inspected phoneme acquisition by using a computational model. However, these studies have used a limited vocabulary as input and do not handle a continuous speech that is almost comparable to a natural environment. Therefore, we use a natural continuous speech and build a self-organization model that simulates the cognitive ability of the humans, and we analyze the quality and quantity of the speech information that is necessary for the acquisition of the native phoneme system. Our model is designed to learn values of the acoustic features of a continuous speech and to estimate the number and boundaries of the phoneme categories without using explicit instructions. In a recent study, our model could acquire the detailed vowels of the input language. In this study, we examined the mechanism necessary for an infant to acquire all the phonemes of a language, including consonants. In natural speech, vowels have a stationary feature; hence, our recent model is suitable for learning them. However, learning consonants through the past model is difficult because most consonants have more dynamic features than vowels. To solve this problem, we designed a method to separate "stable" and "dynamic" speech patterns using a feature-extraction method based on the auditory expressions used by human beings. Using this method, we showed that the acquisition of an unstable phoneme was possible without the use of instructions.